fix: mypy errors and flaky subscriber/publisher tests#1258
Conversation
I love how the AI just assumed that's the case without any proof. The problem seems to be more complicated than that. |
Yeah, agreed. It's churning now on using a more deterministic solution (at my prompting) like |
e90ddf5 to
35e5b6d
Compare
|
I think it did it... |
|
Is the first commit needed to make the tests pass? |
It is, yes. That commit is what makes up #1255, which still needs to go into |
Two independent root causes block any PR that touches a Rolling/Lyrical job:
mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent
issues older versions silently accepted):
- subscribe.py: extend the type-ignore on the simplejson fallback to also
cover no-redef. The chained 'import as encode_json' pattern legitimately
rebinds the name; newer mypy now requires the suppression to say so.
- websocket_handler.py: narrow Node | None at _log_exception's use site
with the same assert pattern already in open()/on_close(); annotate the
bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional
request.remote_ip to a str at the ClientManager.{add,remove}_client
boundary (Tornado types it as Any | None).
Flaky topic-existence tests: the previous helpers polled the rmw graph
cache via get_(publisher|subscriber)_names_and_types_by_node, which
requires a DDS-discovery round-trip even for self-queries. On 'resolute'
that round-trip routinely overran the 1s polling deadline, so different
tests flaked across different runs.
Switch to a deterministic implementation:
- is_topic_published / is_topic_subscribed in util/ros.py now read
node.publishers / node.subscriptions (the local entity list). These are
populated synchronously inside create_publisher / create_subscription
and cleared inside destroy_publisher / destroy_subscription, so the
result is correct as soon as the calling thread has the entity handle.
- Drop the assert_topic_(not_)subscribed polling helpers from the
subscriber tests; call is_topic_subscribed directly.
- subscribers.py routes destroy_subscription through executor.create_task
(RobotWebTools#1194), so after manager.unsubscribe / MultiSubscriber.unregister the
test thread still has to synchronize with the executor before asserting
the entity is gone. Add wait_for_executor_idle, which enqueues a no-op
task and waits for it; SingleThreadedExecutor processes tasks FIFO, so
any destroy task enqueued earlier has completed when it returns.
- Drop the time.sleep(0.1) before is_topic_published in
test_publisher_manager.test_register_infer_topictype — create_publisher
is now reflected immediately.
Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those
are waiting for threading.Timer to fire and run destroy_publisher, not
for graph propagation.
test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical even though pytest itself completes in ~14s and reports 46/46 passing. The remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon worker threads. Each of the action / service capability tests dispatches the rosbridge worker via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to the production protocol.register_operation path). The worker calls SendGoal.send_goal / call_service.call_service, which busy-wait on a result that is only delivered when the executor spins the registered callback. When tearDown calls executor.shutdown() before the worker's result callback has fired, the busy-wait runs forever. Because the dispatch threads are non-daemon, the Python interpreter blocks at shutdown trying to join them. Mark these worker threads daemon=True so process exit is not gated on them. The threads still complete normally on the happy path; daemon only changes behaviour at interpreter shutdown, mirroring the pattern already used in test_stress_service_clients.py.
b58fb2b to
4a2dc1c
Compare
|
@Mergifyio backport kilted |
✅ Backports have been createdDetails
Cherry-pick of efc9f37 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
#1261) * fix: mypy errors and flaky subscriber/publisher tests (#1258) * chore: fix pre-existing Rolling/Lyrical CI failures Two independent root causes block any PR that touches a Rolling/Lyrical job: mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent issues older versions silently accepted): - subscribe.py: extend the type-ignore on the simplejson fallback to also cover no-redef. The chained 'import as encode_json' pattern legitimately rebinds the name; newer mypy now requires the suppression to say so. - websocket_handler.py: narrow Node | None at _log_exception's use site with the same assert pattern already in open()/on_close(); annotate the bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional request.remote_ip to a str at the ClientManager.{add,remove}_client boundary (Tornado types it as Any | None). Flaky topic-existence tests: the previous helpers polled the rmw graph cache via get_(publisher|subscriber)_names_and_types_by_node, which requires a DDS-discovery round-trip even for self-queries. On 'resolute' that round-trip routinely overran the 1s polling deadline, so different tests flaked across different runs. Switch to a deterministic implementation: - is_topic_published / is_topic_subscribed in util/ros.py now read node.publishers / node.subscriptions (the local entity list). These are populated synchronously inside create_publisher / create_subscription and cleared inside destroy_publisher / destroy_subscription, so the result is correct as soon as the calling thread has the entity handle. - Drop the assert_topic_(not_)subscribed polling helpers from the subscriber tests; call is_topic_subscribed directly. - subscribers.py routes destroy_subscription through executor.create_task (#1194), so after manager.unsubscribe / MultiSubscriber.unregister the test thread still has to synchronize with the executor before asserting the entity is gone. Add wait_for_executor_idle, which enqueues a no-op task and waits for it; SingleThreadedExecutor processes tasks FIFO, so any destroy task enqueued earlier has completed when it returns. - Drop the time.sleep(0.1) before is_topic_published in test_publisher_manager.test_register_infer_topictype — create_publisher is now reflected immediately. Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those are waiting for threading.Timer to fire and run destroy_publisher, not for graph propagation. * test: daemonize worker threads in capability tests test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical even though pytest itself completes in ~14s and reports 46/46 passing. The remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon worker threads. Each of the action / service capability tests dispatches the rosbridge worker via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to the production protocol.register_operation path). The worker calls SendGoal.send_goal / call_service.call_service, which busy-wait on a result that is only delivered when the executor spins the registered callback. When tearDown calls executor.shutdown() before the worker's result callback has fired, the busy-wait runs forever. Because the dispatch threads are non-daemon, the Python interpreter blocks at shutdown trying to join them. Mark these worker threads daemon=True so process exit is not gated on them. The threads still complete normally on the happy path; daemon only changes behaviour at interpreter shutdown, mirroring the pattern already used in test_stress_service_clients.py. (cherry picked from commit efc9f37) # Conflicts: # rosbridge_library/test/capabilities/test_action_capabilities.py * Fix conflicts --------- Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com> Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
#1261) (#1262) * fix: mypy errors and flaky subscriber/publisher tests (#1258) * chore: fix pre-existing Rolling/Lyrical CI failures Two independent root causes block any PR that touches a Rolling/Lyrical job: mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent issues older versions silently accepted): - subscribe.py: extend the type-ignore on the simplejson fallback to also cover no-redef. The chained 'import as encode_json' pattern legitimately rebinds the name; newer mypy now requires the suppression to say so. - websocket_handler.py: narrow Node | None at _log_exception's use site with the same assert pattern already in open()/on_close(); annotate the bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional request.remote_ip to a str at the ClientManager.{add,remove}_client boundary (Tornado types it as Any | None). Flaky topic-existence tests: the previous helpers polled the rmw graph cache via get_(publisher|subscriber)_names_and_types_by_node, which requires a DDS-discovery round-trip even for self-queries. On 'resolute' that round-trip routinely overran the 1s polling deadline, so different tests flaked across different runs. Switch to a deterministic implementation: - is_topic_published / is_topic_subscribed in util/ros.py now read node.publishers / node.subscriptions (the local entity list). These are populated synchronously inside create_publisher / create_subscription and cleared inside destroy_publisher / destroy_subscription, so the result is correct as soon as the calling thread has the entity handle. - Drop the assert_topic_(not_)subscribed polling helpers from the subscriber tests; call is_topic_subscribed directly. - subscribers.py routes destroy_subscription through executor.create_task (#1194), so after manager.unsubscribe / MultiSubscriber.unregister the test thread still has to synchronize with the executor before asserting the entity is gone. Add wait_for_executor_idle, which enqueues a no-op task and waits for it; SingleThreadedExecutor processes tasks FIFO, so any destroy task enqueued earlier has completed when it returns. - Drop the time.sleep(0.1) before is_topic_published in test_publisher_manager.test_register_infer_topictype — create_publisher is now reflected immediately. Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those are waiting for threading.Timer to fire and run destroy_publisher, not for graph propagation. * test: daemonize worker threads in capability tests test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical even though pytest itself completes in ~14s and reports 46/46 passing. The remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon worker threads. Each of the action / service capability tests dispatches the rosbridge worker via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to the production protocol.register_operation path). The worker calls SendGoal.send_goal / call_service.call_service, which busy-wait on a result that is only delivered when the executor spins the registered callback. When tearDown calls executor.shutdown() before the worker's result callback has fired, the busy-wait runs forever. Because the dispatch threads are non-daemon, the Python interpreter blocks at shutdown trying to join them. Mark these worker threads daemon=True so process exit is not gated on them. The threads still complete normally on the happy path; daemon only changes behaviour at interpreter shutdown, mirroring the pattern already used in test_stress_service_clients.py. (cherry picked from commit efc9f37) # Conflicts: # rosbridge_library/test/capabilities/test_action_capabilities.py * Fix conflicts --------- (cherry picked from commit 65034c7) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com> Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
#1263) * fix: mypy errors and flaky subscriber/publisher tests (backport #1258) (#1261) * fix: mypy errors and flaky subscriber/publisher tests (#1258) * chore: fix pre-existing Rolling/Lyrical CI failures Two independent root causes block any PR that touches a Rolling/Lyrical job: mypy (Ubuntu 25.10 'resolute' ships a stricter mypy that flags 5 latent issues older versions silently accepted): - subscribe.py: extend the type-ignore on the simplejson fallback to also cover no-redef. The chained 'import as encode_json' pattern legitimately rebinds the name; newer mypy now requires the suppression to say so. - websocket_handler.py: narrow Node | None at _log_exception's use site with the same assert pattern already in open()/on_close(); annotate the bare protocol_parameters ClassVar as dict[str, Any]; coerce the optional request.remote_ip to a str at the ClientManager.{add,remove}_client boundary (Tornado types it as Any | None). Flaky topic-existence tests: the previous helpers polled the rmw graph cache via get_(publisher|subscriber)_names_and_types_by_node, which requires a DDS-discovery round-trip even for self-queries. On 'resolute' that round-trip routinely overran the 1s polling deadline, so different tests flaked across different runs. Switch to a deterministic implementation: - is_topic_published / is_topic_subscribed in util/ros.py now read node.publishers / node.subscriptions (the local entity list). These are populated synchronously inside create_publisher / create_subscription and cleared inside destroy_publisher / destroy_subscription, so the result is correct as soon as the calling thread has the entity handle. - Drop the assert_topic_(not_)subscribed polling helpers from the subscriber tests; call is_topic_subscribed directly. - subscribers.py routes destroy_subscription through executor.create_task (#1194), so after manager.unsubscribe / MultiSubscriber.unregister the test thread still has to synchronize with the executor before asserting the entity is gone. Add wait_for_executor_idle, which enqueues a no-op task and waits for it; SingleThreadedExecutor processes tasks FIFO, so any destroy task enqueued earlier has completed when it returns. - Drop the time.sleep(0.1) before is_topic_published in test_publisher_manager.test_register_infer_topictype — create_publisher is now reflected immediately. Existing time.sleep(manager.unregister_timeout + 1.0) waits stay: those are waiting for threading.Timer to fire and run destroy_publisher, not for graph propagation. * test: daemonize worker threads in capability tests test_capabilities was hitting CTest's 60-second timeout on Rolling and Lyrical even though pytest itself completes in ~14s and reports 46/46 passing. The remaining ~46s is wasted in Python interpreter shutdown waiting on non-daemon worker threads. Each of the action / service capability tests dispatches the rosbridge worker via Thread(target=self.send_goal.send_action_goal, ...).start() (analogous to the production protocol.register_operation path). The worker calls SendGoal.send_goal / call_service.call_service, which busy-wait on a result that is only delivered when the executor spins the registered callback. When tearDown calls executor.shutdown() before the worker's result callback has fired, the busy-wait runs forever. Because the dispatch threads are non-daemon, the Python interpreter blocks at shutdown trying to join them. Mark these worker threads daemon=True so process exit is not gated on them. The threads still complete normally on the happy path; daemon only changes behaviour at interpreter shutdown, mirroring the pattern already used in test_stress_service_clients.py. (cherry picked from commit efc9f37) # Conflicts: # rosbridge_library/test/capabilities/test_action_capabilities.py * Fix conflicts --------- Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com> Co-authored-by: Błażej Sowa <bsowa123@gmail.com> (cherry picked from commit 65034c7) # Conflicts: # rosbridge_library/src/rosbridge_library/capabilities/subscribe.py # rosbridge_library/test/internal/subscribers/test_multi_subscriber.py # rosbridge_server/src/rosbridge_server/websocket_handler.py * Fix conflicts --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
Summary
Pre-existing CI hygiene on Rolling and Lyrical that has been red on
ros2since the move toubuntu:resolute:resolute's package set) flags 5 latent typing issues that older mypy let slide — one inrosbridge_library/capabilities/subscribe.py, four inrosbridge_server/websocket_handler.py. None are caused by the changes in this branch; they are simply now load-bearing for any PR that touches a Rolling/Lyrical job.resoluteoverran the topic-existence test deadlines. The helpers inrosbridge_library/util/ros.pypolled the rmw graph cache viaget_(publisher|subscriber)_names_and_types_by_node, which requires a discovery round-trip even for self-queries. Different subscriber / publisher tests flaked across different runs.Dependency
This PR stacks on top of #1255 (the service / action lifecycle race fix). #1255 must land first; without it
test_action_capabilitiesSIGSEGVs and the rest of the Rolling/Lyrical job is moot.Approach
mypy
Smallest possible patches at each error site — no broader refactor, no library-wide annotation pass.
subscribe.py:61— extend thetype: ignoreon thesimplejsonfallback to also coverno-redef. The chainedfrom X import Y as encode_jsonpattern legitimately rebinds the name; the new mypy version requires the suppression to say so.websocket_handler.py::_log_exception— narrowRosbridgeWebSocket.node_handlewith the sameassert isinstance(..., Node)pattern already used inopen()/on_close()before calling.get_logger().websocket_handler.py::protocol_parameters— annotate the bareClassVarasClassVar[dict[str, Any]]to match the value it's actually assigned inrosbridge_websocket.py.websocket_handler.py::open/on_close— passself.request.remote_ip or ""toClientManager.add_client/remove_client. Tornado typesremote_ipasAny | None; the manager signature isstr.Topic-existence checks
The polling-based
assert_topic_(not_)subscribedhelpers, and the rmw-graph-cache queries they were built around, were always going to flake. The graph cache is updated by DDS discovery events; even for a node querying its own subscriptions, that's an asynchronous round-trip, and theresolutepackage set's defaults make the round-trip slower.The deterministic alternative — and the one this PR adopts — is to read the node's own entity list directly:
is_topic_published(node, name)→any(p.topic_name == name for p in node.publishers)is_topic_subscribed(node, name)→any(s.topic_name == name for s in node.subscriptions)Both
node.publishersandnode.subscriptionsare populated synchronously insidecreate_publisher/create_subscriptionand cleared insidedestroy_publisher/destroy_subscription, so the result is correct as soon as the calling thread has the entity handle — no DDS involvement.The one remaining asynchrony is that
MultiSubscriber.unregisterroutesdestroy_subscriptionthroughexecutor.create_task(added in #1194). Aftermanager.unsubscribe, the destroy task is queued but may not have run yet from the test thread's point of view.wait_for_executor_idlehandles this: it submits a no-op task and waits for it to run.SingleThreadedExecutorprocesses tasks FIFO, so when the no-op completes every task enqueued before it has also completed.Why not
launch_testing?launch_testingis for testing systems that span processes. The tests here exercise the in-processSubscriberManager/PublisherManagerAPI on a single node — moving them underlaunch_testingwould impose a per-test process boundary and a much larger refactor without changing the determinism story.node.publishers/node.subscriptionsgive the same guarantee with no new infrastructure.What changed
rosbridge_library/util/ros.py— rewriteis_topic_published/is_topic_subscribedagainstnode.publishers/node.subscriptions; addwait_for_executor_idle.assert_topic_(not_)subscribedpolling helpers from the two subscriber test files; replace with directassertTrue/assertFalseagainstis_topic_subscribed. Addwait_for_executor_idle(self.executor)after eachmanager.unsubscribe/multi.unregisterbefore asserting the entity is gone.time.sleep(0.1)beforeis_topic_publishedintest_publisher_manager.test_register_infer_topictype—create_publisheris now reflected immediately. Thetime.sleep(manager.unregister_timeout + 1.0)waits stay; those wait onthreading.Timer, not graph propagation.subscribe.py,websocket_handler.py— the mypy fixes described above.Local validation
Reproduced on a
ros:rolling-perceptioncontainer with current rclpy:test/internal/subscribers/+test/internal/publishers/— 44/44 pass (the previously flakytest_register_infer_topictype/test_register_subscriber_multiclientetc. now resolve as soon ascreate_subscriptionreturns instead of polling).test/capabilities/— 46/46 pass with fix: Prevent client destruction race in services.call_service #1255 applied (without fix: Prevent client destruction race in services.call_service #1255,test_action_capabilitiesSIGSEGVs before reaching anything in this PR's scope).